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(54) Video and audio recording 

(57) A camera-recorder apparatus comprises an im- 
age capture device operable to capture a plurality of vid- 
eo images; a storage mecjium by which the video imag- 
es are stored for later retrieval; a feature extraction unit 
operable to derive image property data from the image 
content of at least one of the video images substantially 



in realtime at the capture of the video images, the image 
property data being associated with respective images 
or groups of images; and a data path by which the cam- 
era-recorder apparatus is operable to transfer the de- 
rived image property data to an external data processing 
apparatus. 
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Description 

[0001] The presenl invention relates to the field of vid- 
eo and audio information processing. 
[0002] Video cameras produce audio and video foot- 
age that will typically be extensively edited before a 
broadcast quality programme is finally produced. The 
editing process can be very time consuming and there- 
fore accounts for a significant fraction of the production 
costs of any programme. 

[0003] Video images and audio data will often be ed- 
ited "off-line" on a computer- based digital non-linear ed- 
iting apparatus. A non-lin ear editing system provides the 
flexibility of allowing footage to be edited starting at any 
point in the recorded sequence. The images used for 
digital editing are often a reduced resolution copy of the 
original source material which, although not of broad- 
cast quality, is of sufficient quality for browsing the re- 
corded malerial and for performing off-line ediLing deci- 
sions, The video Images and audio data can be edited 
independently. 

[0004] The end-product of the off-line editing process 
is an edit decision list (EDL). The EDL is a file that iden- 
tifies edit points by their timecode addresses and hence 
contains the required instructions for editing the pro- 
gramme. The EDL is subsequently used to transfer the 
edit decisions made during the off-line edit to an "on- 
line 0 edit in which the master tape is used to produce a 
high-resolution broadcast quality copy of the edited pro- 
gramme. 

[0005] The off-line non-linear editing process, al- 
though flexible, cap be very time consuming. It relies on 
the human operator to replay the footage in real time, 
segment shots into sub-shots and then to arrange the 
shots inthe desired chronological sequence. Arranging 
"The snots in an acceptable^ al sequence^nce1yTo~e~n~ 
tail viewing the shot, perhaps several times over, to as- 
sess its overall content and consider where it should be 
inserted in the final sequence. 
[0006] The audio data could potentially be automati- 
cally processed at the editing stage by applying a 
speech detection algorithm to identify the audio frames 
most likely to contain speech. Otherwise the editor must 
listen to the audio data in real time tD identify its overall 
content. 

[0007] Essentially the editor has to start from scratch 
with the raw audio frames and video images and pains- 
takingly establish the contents of the footage. Only then 
can decisions be made on how shots should be seg- 
mented and on the desired ordering of the final se- 
— quenc c. ■ 
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a feature extraction unit operable to derive image 
property data from the image conlentof at least one 
of the video images substantially in real time at the 
capture of the video images, the image property da- 
ta being associated with respective images or 
groups of images; and 

a data path by which the camera-recorder appara- 
tus is operable to transfer the derived image prop- 
. erty data to an external data processing apparatus. 

[0009] The invention recognises that the time taken 
for a human editorto review the material on a newly ac- 
quired video tape or the like places a great burden on 
the editing process, slowing down the whole editing op- 
eration. However, simply automating the review of the 
material ai an editing apparatus would not reap signifi- 
cant benefits. Although such a simple automation would 
reduce the need for (expensive) human intervention, It 
would not significantly speed up the process. This factor 
is important in time-critical applications such as news- 
gathering. 

[0010] In contrast, in the invention, by deriving data 
characteristic of the image content substantially in real 
time at the camera-recorder apparatus, the data is ready 
to be analysed much more quickly, and without neces- 
sarily the need for a machine to review the entire video 
material. This can dramatically speed up automated 
preparation for the editing process. 
[0011] Embodiments of the invention will now be de- 
scribed by way of example only with reference to the 
accompanying drawings, in which: 



Figure 1 shows a downstream audio and video 
processing system according to embodiments of 
the invention; 

^igare^-hows-a-video-camera-and-metastore-ac-— 
cording to embodiments of the invention; 
Figure 3 is a schematic diagram of a feature extrac- 
tion module and a metadata extraction module ac- 
cording to embodiments of the invention; 
Figure 4 shows a video camera and a persona! dig- 
ital assistant according to a first embodiment of the 
invention; 

Figure 5 shows a camera and a personal digital as- 
sistant according to a second embodiment of the 
invention; 

Figure 6 is a schematic diagram illustrating the com- 
ponents of the persona! digital assistant according 
to embodiments of the invention; and 
Figure 7 is a schematic diagram of an audio and 
^deoJnfoOTaiiori^iDcessin^ 



[0008] The invention provides a camera-recorder ap- 
paratus comprising: 

an image capture device operable to capture a plu- 
rality of video images; 

a storage medium by which the video images are 
stored for later retrieval; 



~~ tern according to embodiments orThe invention . 

[0012] Figure 1 shows a downstream audio-visual 
55 processing system according to the present invention. 
' A camera 1 0 records audio and video data on videotape 
in the camera. The camera 10 also produces and 
records supplementary information about the recorded- 
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video footage known as "metadata". This metadata will 
typically include the recording date, recording start/end 
.flags or timecodes, camera status data and a unique 
identification index for the recorded material known as 
an SMPTE UMID. 

{0013] The UMID is described in the March 2000 is- 
sue of the "SMPTE Journal". An "extended UMID" com- 
prises a first set of 32 bytes ol "basic UMID" and a sec- 
ond set of 32 bytes of "signature metadata". 
[0014] The basic UMID has a key'-length-value (KLV) 
structure and it comprises: 

■ A 12-byte Universal Label or key which identifies 
the. SMPTE UMID itself, the type of material to 
which the UMID refers. It also defines the methods 
by which the globally unique Material, and locally 
unique Instance numbers (defined below) are cre- 
ated. 

■ A 1 -byte length value which specifies the length of 
the remaining part of the UMID. 

■ A 3-byte Instance number used to distinguish be- 
tween different 'instances' or copies of material with 
the same Material number. 

■ A 16-byte Material number used to identify each 
clip. A Material number is provided at least for each 
shot and potentially for each image frame. 

[0015] The signature metadata comprises; 

■ An 8-byte time/date code identifying thetime of cre- 
ation of the "Content Unit 1 ' to which the UMID ap- 
plies. The first 4-bytes are a Universal Time Code 
(UTC) based component. 

■ k 12-byte value which defines the (GPS derived) 
spatial co-ordjnates at the time of Content Unit cre- 

~ ation. 

■ 3 groups of 4-byte codes which comprise a country 
code, an organisation code and a user code. 

[0016] Apart from the basic metadata described 
above which serves to identify properties of the record- 
ing itself, additional metadata is provided which de- 
scribes in detail, the contents of the recorded audio data 
and video images. This additional metadata comprises 
"feature-vectors", preferably on a frame-by-frame basis, 
and is generated by hardware in the camera 10 by 
processing the raw video and audio data, in real time as 
(or immediately after) it is captured. 
[0017] The feature vectors could for example supply 
data to indicate if a given frame has speech associated 
— w1th"1t^Ti'd^hBth^r^r^Ptit^epresents^inHmage"of--a- 



10 



15 



20 



25 



30 



35 



40 



45 



50 



Sony's "Tele- File®" system. Underthis Telefile system, 
the metadata is stored in a contact-less memory inte- 
grated circuit contained within the video-cassette label 
which can be read, written and rewritten with no direct 
electrical contact to the label. 

[0019] All of the metadata information is transfen'ed 
to a metastore 20 along a metadata data path 15 which 
could represent videotape, a removable hard disk drive 
or a wireless local area network (LAN). The metastore ■ 
20 has a storage capacity 30 and a central processing 
unit 40 which performs calculations to effect full meta- 
data extraction and analysis. The metastore 20 uses the 
feature-vector metadata: to automate functions such as 
sub-shot segmentation; to identify footage likely to cor- 
respond to an interview as indicated by the simultane- 
ous detection of a face and speech in a series of con- 
tiguous frames; to produce representative Images for 
use in an off-line editing system which reflect the pre- 
dominant overall contents of each shot; and to calculate 
properties associated with encoding of the audio and 
video information. 

[0020] Thus the metadata feature-vector information 
affords automated processing of the audio and video da- 
ta prior to editing. Metadata describing the contents of 
the audio and video data is centrally stored in the metas- 
tore 20 and It is linked to the associated audio and video 
data by a unique identifier such as the SMPTE UMID. 
The audio and video data will generally be stored inde- 
pendently of the metadata. The use of the' metastore 
makes feature-vector data easily accessible and pro- 
vides a large information storage capacity. 
[0021] The metastore also performs additional 
processing .of -feature-vector data, automating many 
processesthat would otherwise be performed by the ed- 
itor. The processed feature-vector data is potentially 
"available at the beginning onhe~o1fnih^^^ 
which should result in a much more efficient and less 
time-consuming editing operation. 
[0022] Figure 2 illustrates schematically how the main 
components of the video camera 1 0 and the metastore 
20 interact according to embodiments of the invention. 
An image pickup device 50 generates audio and video 
data signals 55 which it feeds to an image. processing 
module 60. The image processing module 60 performs 
standard Image processing operations and outputs 
processed audio and video dataaiong a main data path 
85. The audio and video data signals 55 are also fed to 
afeature extraction module 80 which performs process- 
ing operations such as speech detection and hue histo- 
gram calculation, and outputs feature-vector data 95. 
>e^mage-piektfp-device-5Q-supplies-a-signal-654:o^- 



metadata generation unit 70 that generates the basic 
metadata information 75 which includes a basic UMID 
an-d;startten^ 

55 tion and the feature -vector data 95 are multiplexed and 
sent along a metadata data path 15. 
[0023] The metadata data path directed into a meta- 
data extraction module 90 located in the metastore 20. 
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face. Furthermore the feature vectors could include in- 
formation about certain image properties' such as the 

Hffi^ltqties]^ .. 

[0018] The main metadata, which includes a UMID 
and start/end timecodes, could be recorded on video- 
tape along with the audio and video data, but preferably 
it will be stored using a proprietary system such as 
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The metadata exlraction module 90 performs full meta- 
data extraction and uses the feature-vector data 95 gen- 
erated in the video camera to perform additional data 
processing operations to produce additional information 
about the content of the recorded sound and images. 5 
For example the hue feature vectors can be used by the 
metadata extraction module 90 (i.e. additional metada- 
ta) to perform sub-shot segmentation. This process will 
be described below. The output data 115 of the meta- 
data extraction module 90 is recorded in the main stor- io 
age area 30 of the metastore 20 where it can be re- 
trieved by an off-line editing apparatus. ' 
[0024] Figure 3 is a schematic diagram of a feature 
extraction module and a metadata extraction module 
according to embodiments of the invention. 15 
[0025] As mentioned above, the ieft hand side of Fig- 
ure 3 shows that the feature extraction module 80 of the 
video camera 10, comprises a hue histogram calcula- 
tion unit 100, a speech detection unit 110 and a face 
detection unit 120. The outputs of these feature extrac- 20 
tion units are supplied to the metadata extraction mod- 
ule 90 for further processing. 

[0026] The hue histogram calculation unit 100 per- 
forms an analysis of the hue values of each image. Im- 
age pick-up systems in a camera detect primary-colour 25 
red, green and blue (RGB) signals. These signals are 
format-converted and stored in a different colour space 
representation. On analogue video tape (such as PAL 
and NTSC) the signals are stored in YUV space where- 
as digital video systems store the signals in the standard 30 
YCrCb colour space. A third colour space is hue-satu- 
ration-value (HSV) ; The hue reflects the dominant 
wavelength of the spectral distribution, the saturation is 
a measure of the concentration of a spectra! distribution 
at a single wavelength and the value is a measure ot the 35 
intensiiy otlhe colour. IrTffie 



extraction module 80 performs an analysis of th e record- 
ed audio data. The speech detection unit 110 performs 
a spectral analysis of the audio material, typically on a 
frame-by-frame basis, in this context, the term "frame" 
refers to an audio frame of perhaps 40 milliseconds du- 
ration and nol to a video frame. The spectral content of 
each audio frame is established by applying a fast Fou- 
rier transform (FFT) to the audio data using either soft- 
ware or hardware. This provides a profile of the audio 
data in terms of power as a function of frequency. 
[0029] The speech detection technique used in this 
embodiment exploits the fact that human speech tends 
to be heavily harmonic in nature. This is particularly true 
of vowel sounds. Although different speakers have dif- 
ferent pitches in their voices, which can vary from frame 
to frame, the fundamental frequencies of human speech 
will generally lie in the range from 50-250 Hz. The con- 
tent of the audio data is analysed by applying a series 
of "comb filters" lo the audio data. A comb filler is an 
Infinite Impulse Response (IIR) filter that routes the out- 
put samples back to the input after a specified delay 
time. The comb filter has multiple relatively narrow pass- 
bands, each having a centre frequency at an integer 
multiple of the fundamental frequency associated with 
the particular filter. The output of the comb filter based 
on aparticularfundamental frequency provides an indi- 
cation of how heavily the audio signal in that frame is 
harmonic about that fundamental frequency. A series, of 
comb filters with fundamental frequencies in the range 
50-250 Hz is applied to the audio data 
[0030] When an FFT process is applied to the audio 
material first, as in this embodiment, the comb filter is 
conveniently implemented in a simple selection of cer- 
tain FFT coefficients. 

[0031] ' The sliding comb filter thus gives a quasi-con- 



specifies the colour in a 360° range. 
[0027] The hue histogram calculation unit 100 per- 
forms, if so required, the conversion of audio and video 
data signals from an arbitrary colour space to the HSV *o 
colour space. The hue histogram calculation unit 1 00 
then combines the. hue values for the pixels of each 
frame to produce for each frame a "hue histogram" of 
frequency of occurrence as a function of hue value. The ■ 
hue values are In the range Q°< hue< 360° and the bin- 45 
size of the histogram, although poientially adjustable, 
would typically be 1°. In this case a feature veclor with 
360 elements will be-produced for each frame. Each el- 
ement of the hue feature vector will represent the fre- 
quency of occurrence of the hue value associated with so 
Nat^lementr^^e-values-will-generally-be-prev-ided-fer 



ISV colour space hue tinTJoirs-s-eTie^^^^ 



every pixel of the frame but it Is also possible that a sin- 
gle hue value will be derived (e.g. by an averaging proc- 
ess) corresponding to a group of several pixels. The hue 
.feature-vectors can subsequently be used in the meta- 
data extraction module 90 to perform sub-shot segmen- 
tation and representative image extraction. 
[0028] The speech detection unit 110 in the feature 



harmonic content of the audio signal for a particular fun- 
damental audio frequency. Within this series of outputs, 
the maximum output is selected for each audio frame. 
This maximum output is known as the "Harmonic Index" 
(HI) and Its value is compared with a predetermined 
threshold to determine whether or not the associated 
audio frame is likely to contain speech. 
[0032] The speech detection unit 110 located in the 
feature extraction module 80, produces a feature-vector • 
for each audio frame. In its most basic form this is a sim- 
ple flag that indicates whether or not speech is present. 
Data corresponding to the harmonic index for each 
frame could also potentially be supplied as feature-vec- 
tor data. Alternative embodiments of the speech detec- 
tion- im it-ti-Q-m ight^utput-a4eatiire-vactor-comprising- 



the FFT coefficients for each audio frame, in which case 
the processing to determin e the harmonic index and the 
likelihood of speech being present would be carried out 
55 j n the metadata extraction moduie 90. The feature ex- 
traction module 80 could include an additional unit 130 
for audio frame processing to detect musical sequences 
or pauses in speech. 
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[0033] The face detection unit 1 20 located in the fea- 
ture extraction module 80, analyses video Images to de- 
termine whether or not a human face is present. This 
unit .implements an algorithm to detectfaces such as the 
Facelt® algorithm produced by the Visionics Corpora- 
tion and commercially available at the priority date of 
this patent application. This face detection algorithm us- 
es the fact that all facial images can besynthesised from 
an irreducible set of building elements. The fundamental 
building elements are derived from a representative en- 
semble of faces using statistical techniques. There are 
more facia! elements than there are facial parts. Individ- 
ual faces can be identified by the facial elements they 
possess and by their geometrical combinations. The al- 
gorithm can map an individual's identity into a mathe- 
matical lormula known as a "faceprint". Each facial im- 
age can be compressed to produce a faceprint of around 
84 bytes in size. The face of an individual can be recog- 
nised from this faceprint regardless of changes in light- 
ing oi skin tone, facial expressions or hairstyle and in 
the presence or absence of spectacles. Variations in the 
angle of the face presented to the camera can be up to 
around 35° in all directions and movement of faces can 
be tolerated. 

[0034] The algorithm can therefore be used to deter- 
mine whether or not a face is present on an image-by- 
image basis and to'determine a sequence of consecu- 
tive images in which the same faceprint appears. The 
software supplier asserts that faces which occupy as lit- 
tle as 1% of the image area can be recognised using the 
algorithm. 

[0035] The face detection unit 1 20 outputs basic fea- 
ture-vectors 155 for each image comprising a simple 
flag to indicate whether or not a face has been detected 
in the respective image. Furthermore, the faceprint data 
for each of the detected faces is output as feature-vector 
data 155, together with a key or lookup table which re- 
lates each image in which at least one face has been 
detected to the corresponding detected faceprint(s). 
This data will ultimately provide the editor with the facility 
to search through and select all of the recorded video 
images in which a particular faceprint appears, 
[0036] The right hand side of Figure 3 shows that the 
metadata extraction module 90 of the video camera 1 0, 
comprises a representative Image extraction unit 150, 
an "activity" calculation unit 160, a sub^shot segmenta- 
tion unit 170 and an interview detection unit 1 80. 
[0037] The representative image extraction unit 150 
uses the feature vector data 1 55 for the hu e image prop- 
erty to extract a representative image which reflects the 



for the shot according to the formula: 



1 n K 



where i is an index forthe histogram bins, h ! j is the av- 
10 erage frequency of occurrence of the hue value associ- 
ated with the ith bin, hj is the hue value associated with 
the ith bin for frame F and n F is the number of frames in 
the shot. If the majority of the frames in the shot corre- 
spond to the same scene then the hue histograms for 
15 those shots will be similar in shape therefore the aver- 
age hue histogram will be heavily weighted to reflect the 
hue profile of that predominant scene. 
[0039] The representative Image Is extracted by per- 
forming a comparison between the hue histogram for 
%o each frame of a shot and the average hue histogram for 
that shot. A singled valued difference diff F is calculated 
according to the formula: 



2S 



nbins I ^ 
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[0040] For each frame F (1 £ F < n F ) of a shot, one 
frame from the n F frames is selected which has the min- 
imum value of diff F . The above formula represents the 
preferred method for calculating the single valued dif- 
. ference; however it will be appreciated that alternative 
formulae can be used to achieve the same effect An 
alternative would be to sum the absolute value of the 
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gram data included in feature-vector data 155 compris- 
es a hue histogram for each image. This feature-vector 
^ala-|srccTintf 

mation output by sub-shot segmentation unit 1 70 to cal- 
culate the average hue histogram data for each shot. 
[0038] The hue histogram information for each frame 
of the shot is used to determine an average histogram 



"aifterence (ITpTj), to form ^wetghte-d-simiTDf-diffeTeric^ 
es or to combine difference values for each image prop- 
erty of each frame. The frame with the minimum differ- 
ence will have the hue histogram closest to the average 
hue histogram and hence it is preferably selected as the 
representative keystamp (RKS) image forthe associat- 
ed shot The frame for which the minimum difference is 
smallest can be considered to have the hue histogram 
which is closest to the average hue histogram. If the val- 
ue of the minimum difference is the sam e for two frames 
or more in the same shot then there are multiple frames 
which are closest to the average hue histogram however 
the first of these frames can be selected to be the rep- 
resentative keystamp. Although preferably the frame 
with the hue histogram that is closest to the average hue 
iistogramis^eleoted4Q-be-the-RKSrattematjvel y.^ruj p^ 



per threshold can be defined for the single valued dif 1 " 
ference such that the first frame in the temporal se- 
D^nceTpfthB-sho^ 
55 lies below the threshold is be selected as an RKS. It will 
be appreciated that, in general, any frame of the shot 
having a minimum difference which lies below the 
threshold could be selected as an RKS. The RKS imag- 
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es are the output of representative image extraction unit 
150. 

[0041] The RKS images can be used In the off-line 
edit suite as thumbnail images to represent the overall 
predominant contents of the shots. The editor can see s 
the. RKS at a glance and its availability will reduce the 
likelihood of having lo replay a given shot in real time. 
[0042] The "activity" calculation unit 160 uses the hue 
feature-vector data generated by the hue histogram cal- 
culation unit 1 00 to calculate an activity measure forthe 10 
•captured video images. The activity measure gives an 
indication of how much the image sequence changes 
from frame to frame. It can be calculated on a global 
level such as across the full temporal sequence of a shot 
or at a local level with respect to an image and its sur- is 
rounding frames. In this embodiment the activity meas- 
ure Is calculated from the local variance in the hue val- 
ues. It will be appreciated that the local variance of other 
image properties such as the luminosity could alterna- 
tively be used to obtain an activity measure, The advan- 20 
tage of using the hue is that the variability in the activity 
measure due to changes in lighting conditions is re- 
duced. A further alternative would be to use the motion 
vectors to calculate an activity measure. 
[0043] The activity measu re data output by the activity 25 
calculation unit will subsequently be used by the offline 
editing apparatus and metadata enabled devices such 
as video tape recorders and digital video disk players to 
provide the viewer of recorded video images with a "vid- 
eo skim 0 and an "information shuttle" function. so 
[0044] The video skim function is an automatically 
generated accelerated replay of a video sequence. Dur- 
ing the accelerated replay, sections in the temporal se- 
quence of images for which the activity measure is be- 
low a predetermined threshold are either replayed in fast 35 

"shuttle or are skipped over ciaffipTeTely; 

[0045] The information shuttle function provides a 
mapping between settings on a user control (such as a 
dial on aVTR) and the information presentation rate de- 
termined from the activity measure of the video images. 40 
This is differs from a standard fast forward function 
which simpiy maps settings on the user control to the 
video replay rate and takes no account of the content of 
the images being replayed 

[0046] The "activity" calculation unit 1 60 also serves *s 
to measure the activity level In the audio signal associ- 
ated with the video images. It uses the feature-vectors 
produced by the speech detection unit 11 0 and performs 
processing operations to identify temporal sequences 
of normal speech activity, to identify pauses in speech so 
^nd4o<iistinguisk-speech-fon^^ 



ground noise. The volume of the sound is also used to 
identify high audio activity. This volume-based audio ac- 
tivity information is particularly useful for identifying sig- 
nificant sections of the video footage for sporting events 55 
where the level of jnterest can be gauged by the crowd 
reaction. 

[0047] The sub-shot segmentation module uses the 



feature vector data 155 for the hue image property to 
perform sub-shot segmentation. The sub-shot segmen- 
tation is performed by calculating the element-by-ele- 
ment difference between the hue histograms for con- 
secutive images and by combining these differences lo 
produce a single valued difference. A scene change is 
flagged by locating an image with a single valued-differ- 
ence that lies above a predetermined threshold. 
[0048] Similarly a localised change in the subject of a 
picture, such as the entry of an additional actor to a 
scene, can be detected by calculating the single-valued 
difference between the hue histogram of a given image 
and a hue histogram representing the average hue val- 
ues of images from the previous one second of video 
footage, 

[0049] The interview detection unit 1 80 uses the fea- 
ture-vector data 155 output by the feature extraction 
module 80 to identify images and associated audio 
frames corresponding lo interview sequences. In partic- 
ular, the interview detection unit 1 80 uses feature vector 
data output by the speech detection unit 110 and the 
face detection unit 120 and combines the Information in 
these feature vectors to detect interviews. At a basic ley r 
el the simple flags which identify tho presence/absence 
of speech and the presence/absence of at least one face 
are used to identify sequences of consecutive images 
where both speech and at least one face have been 
flagged. These shots are likely to correspond to inter- 
view sequences. 

[0050] Once the shots associated with interviews 
have been flagged, the faceprint data of the feature vec- 
tors is subsequently used to identify participants in each 
interview. Furthermore the harmonic index audio data 
from the feature vectors could be used to help discrim- 
inate between the voices of interviewer and interviewee. 
"ThBinteTviewtfete^ 
associated with interviews and to provide the editor with 
the faceprints associated with the participants in each 
interview. 

[0051] Figure 4 shows a camera and a personal digital 
assistant according to a second embodiment of the in- 
vention. The camera includes an acquisition adapter 
270 that performs functions associated with the down- 
stream audio and video data processing. The acquisi- 
tion adapter 270 illustrated in this particular embodiment 
is a distinct unit which interfaces with the camera via a 
built-in docking connector. However, it will be appreci- 
ated that the acquisition unit hardware could alternative- 
ly be incorporated in the main body of the camera. 
[0052] In the main body of the camera, the metadata 
-generation-u nit^^eneratesanxujtpijt20£i^ 
a basic UM(D and in/out timecoaes pershotTThe output - 
205 of the metadata generation unit 70 is fed as input 
to a video storage and retrieval module 200 that stores 
the main metadata and the audio and video data record- 
ed by the camera. The main metadata 205 could be 
stored on the same videotape as that on which the audio 
and video data is stored or ii could be stored separately, 
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for example, on a memory integrated circuit formed as 
pari of a cassette label. 

[0053] The audio and video data and the basic meta- 
data 205 are output as an unprocessed data signal 215 
which is supplied to the acquisition adapter unit 270 of 
the camera 1 0, The unprocessed data signal 21 5 Is in- 
put to a feature vector generation module 220 which 
processes the audio and video data frame-by-frame and 
generates feature vector data which characterises the 
contents of the respective frame. The output 225 of the 
feature vector generation module 220 includes the au- 
dio data, the video images, the main metadata and the 
feature-vector data. All of this data is provided as input 
to a metadata processing module 230. 
[0054] The metadata processing module 230 gener- 
ates the 32-bytes of signature metadata for the extend- 
ed UMID. This module performs processing of the fea- 
ture vector data such as analysis of the hue vectors to 
select an image from a shot which is representative of 
the predominant overall contents of the shot The hue 
feature-vectors can also be used for performing sub- 
shot segmentation. In this particular embodiment, the 
processing of feature- vectors is performed in the cam- 
era acquisition unit 270, but it will be appreciated that 
this processing could alternatively be performed In the 
metastore 20. The output of the metadata processing 
module 230 is a signal 235 comprising processed and 
unprocessed metadata which is stored on a removable 
storage unit 240, The removable storage unit 240 could 
be a flash memory PC card or a removable hard disk 
drive. 

[0055] The metadata Is preferably stored on the re- 
movable storage unit240 in a format such as extensible 
markup language (XML) that facilitates selective con- 
text-dependent data retrieval. This selective data re - 
trieval is achieved by defining custom 'tags'' wnicn mark - 
sections in the XMLdocument according to special cat- 
egories such as metadata objects and metadata tracks. 
[0056] In this embodiment the removable metadata 
storage unit240 can be physically removed from the vid- 
eo camera and plugged directly into the acquisition PDA 
300 where the metadata can be viewed and edited. 
[0057] The unprocessed data signal 215 generated 
by The main camera unit which includes the recorded 
basic audio and video data, apart from being supplied 
tothefeature vector generation module, is also supplied 
to an AV proxy generation module 21-0 located in the 
acquisition adapter 270, The AV proxy generation mod- 
ule 210 produces a low bit-rate copy of the high bit-rate 
broadcast quality video and audio data signal 215 pro- 
-dDCBTJ^by-thencameraH-Or 



(e.g. around 4Mbits/sec) bit-rate copy of the master ma- 
terial. An AV proxy output signal 245 comprises low bit- 
rate video images and audio data. The low bit-rate AV 
proxy, although not of broadcast quality, is of sufficient 
5 resolution for use in browsing the recorded footage and 
for making off-line edit decisions. The AV proxy output 
245 is stored alongside the metadata 235 on the remov- 
able storage unit 235. The AV proxy can be viewed on 
the acquisition PDA 300 by transferring the removable 
w storage unit 240 from the acquisition adapter 270 to the 
PDA 300. 

[0059] Figure 5 shows a camera and a PDA according 
to a second embodiment of the invention. Many of the 
modules in this embodiment "are identical to those in the 
15 embodiment corresponding to Figure 4. A description of 
the functions of these common modules can be found 
in the above description of Figure 4 and shall not be re- 
peated here. 

[0060] The embodiment of the invention shown in Fig- 
20 Lire 5 has an additional optional component located in 
the acquisition adapter 270. This is a GPS receiver250. 
TheGPS receiver250 outputs a spatial co-ordinate data 
signal 255 as required for generation of the signature 
metadata component of the extended UMID. The signa- 
ls ture metadata is generated in the metadata processing 
module 230. Essentially, the GPS co-ordinates of the 
camera serve as a form of identification for the recorded 
material. It will be appreciated thatthe GPS receiver250 
could also be optionally included in the embodiment of 
30 Figure 4. 

[0061 ] The main distinction of second embodiment il- 
lustrated in Figure.5 is distinguished with respect to the 
first embodiment of' Figure 4 is that it comprises a wire- 
less network interface PC card together with aerials 
35 2B0A on the camera and 2B0 B on the PDA. This reflects 

tteTtotthartirti^ 

. 270 is connected to the acquisition PDA by a wireless 
local area network (LAN). 

[0062] The wireless LAN (wireless 802.1 1b with 
4d 10/100 base-t) can typically provide a link within a 50 
metre range and with a data capacity of around 11 Mbits/ 
sec. A broadcast quality image has- a bandwidth of 
around 1 Mbit/image therefore it would ineffective to 
trahsmit broadcast quality video footage across the 
45 wireless LAN. However, the reduced bandwidth AV 
proxy may be transmitted effectively to the PDA across 
the wireless link. 

[0063] The removable storage unit 240 can also be 
used to physically transfer data between the acquisition 
so adapter and the PDA, but without the wireless LAN link 
^etadata^nnotatio.ns^annolJ3eJiiaxifiLWJ3jL 



[0058] The AV proxy is required because the video bit 
rate of high-end equipment such as professional digital 
-l^tacBnrcam^^ 
ond and this data-rate, is likely to be too high to be ap- 
propriate for use by low-end equipment such as desktop 
PC's and PDAs. The AV proxy generator 210 performs 
strong data compression to make a comparatively low 



Is recording because during recora ing the sToTage~D"rnr 
240 will be located in the camera. The wireless LAN link 
— between-the-camera-1-0-and-theTOA-^ 
55 tional advantage over the embodiment of Figure 4 that 
metadata annotations such as the name of an interview- 
ee or the title of a shot can be transferred from the PDA 
to the camera while the video camera is still recording. 
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These metadata annotations could potentially be stored 
on the removable storage unit 240 while it is still located 
in the camera's acquisition adapter. The wireless LAN 
connection should also allow low bit-rate versions of re- 
corded sound and to be downloaded to the PDA while 
the video camera is still running. 
[0064] If the metadata and AV proxy is stored in the 
removable storage unit 240 in a format such as XML 
Ihen the PDA 300 can selectively retrieve data from the 
XML dala files in the camera to avoid wasting precious 
bandwidth. 

[0065] Figure 6 is a schematic diagram illustrating the 
components of the personal digital assistant 300 ac- 
cording to embodiments of the invention. The PDA op- 
tionally comprises a wireless network interface PC card 
and the aerial 280 B to enable connectivity via the wire- 
less LAN. The PDA 300 optionally comprises a web 
browser 350 which would provide access to data on the 
.inlGmeL 

[0066] The metadata annotation module allows the 
user of the PDA to generate metadata to annotate the 
recorded audio and video footage. Such annotations 
might include the names and credentials of actors; de- 
tails of the camera crew; camera settings; and shot ti- 
tles, 

[0067] An AV proxy viewing module 320 provides the 
facility to view the low-bit-rate copy of the master record- 
ing generated by the acquisition adapter. The AV proxy 
viewing module 320 will typically include offline editing 
functions to allowbasic editing decisions to be made us- 
ing the PDA and to record these as an edit decision list 
for use in on-line editing. The PDA 300 also includes a 
camera set-up and control module 330 which would give 
the user of the PDA the power to change the orientation 
or the settings of the camera remotely. The removable 

dio-visual data and metadata between the camera 10 
and the PDA. 

[0068] Figure 7 is a schematic diagram of an audio 
and video information processing and distribution sys- 
tem according to embodiments of the invention. The 
backbone of the system is the network 400 which could 
be a local network such as an intranet or even an inter- 
net connection. 

[0069] The camera 1 0 is connected to the PDA 300 
via a wireless LAN and/or by the removable storage me- 
dium 240. The camera and PDA are each in communi- 
cation with the metastore 20 via the network 400. A 
metadata enhanced device 41 0, which could be a video 
tape recorder or off-line editing apparatus has access 
-to4foe-i^ 



via the network 400. 
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1 . A camera-recorder apparatus comprising: 

an image capture device operable to capture a 
plurality of video images; 
a storage medium by which the video images 
are stored for later retrieval; 
a feature extraction unit operable to derive im- 
age property data from the Image content of at 
least one of the video images substantially in 
real time at the capture of the video images, the 
image property data being associated with re- 
spective images or groups of images; and 
a data path by which the camera-recorder ap- 
paratus is operable to transfer Ihe derived im- 
age property data to an external dala process- 
ing apparatus. 

2. Apparatus according to claim 1 , in which: 

the camera-recorder apparatus comprises 

means for capturing an audio signal associated 

with the video images; and 

the feature extraction unit is operable to derive 

audio property data for portions of the audio 

signal associated with at least one of the video 

images. 

3. Apparatus according to claim 1 or claim 2, in which 
the image property data is generated for every vid- 
eo image. 



4. 



40 



45 



50 



of these metadata enhanced devices could be connect 
ed to the network 400. This audio and video information 
processing and distribution system should enable re- 
mote access to ail metadata deposited in the metastore 
20. Thus the metadata associated with given audio data 
and video images stored on videotape could be identi- 
fied via the UMID and downloaded from the metastore 



Apparatus according to any one of the above 
claims, comprising a proxy generator operable to 
compress the video images to produce a lower bit- 
rate copy of the respective images. 

Apparatus according to according to any one of the 
above claims in which the data path comprises a 
removable storage medium for storing the image 
property data. 

Apparatus according to according to any one of 
claims 1 to 5, in which the data path comprises a 
wireless network connection device and an antenna 
which are operable to provide a wireless link to the 
<temal-data-processing-apparatus fc 
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7. Apparatus according to any one of the preceding 
claims, in which the image property data comprises 
at least one class of data selected from: 

colour distribution data; 
face.recognition data; and 
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Image activity data. 

8. A method of image acquisition at a camera-recorder 
apparatus, the method comprising the following 
steps by the camera-recorder apparatus: 5 

capturing a plurality of video images; 
storing the video images for later retrieval; 
deriving image property data from the image 
content of at least one of the video images sub- w 
stantially in real time at the capture of the video 
images, the image property data being associ- 
ated with respective images or groups of imag- 
es; and 

transferring, via a data path, the derived image w 
property data to an external data processing 
apparatus. 

9. Computer software having program code for carry- 
ing out a method according to claim 8 or claim 9. 20 

10. A data providing medium by which computer soft- 
ware according to claim 10 is provided. 

11. A medium according to claim 11, the medium being & 
a transmission medium. 

12. A medium according to claim 11, the medium being 
a storage medium. 
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